\section{Introduction}
\label{sec:intro}
%Present the domain, applications, and motivation for this work.
Industrial grade software development is a complex activity involving 
many long-run and non-trivial projects carried out by communities of
skilled professionals and individuals. To help understand the organisation
and interactions within such communities, many researchers have mostly used 
their repositories data (logs, files, mails, issues, minutes, etc.) exported into SQL 
databases archives like those proposed by the FLOSSMole project~\cite{flossmole06},
or set up their own database, whatever the technology. 
After collecting and cleansing the data, it is then analysed with various tools.
For source code management systems (SCM) data tools like CVSAnaly2\footnote{\url{http://github.com/MetricsGrimoire/CVSAnalY}}  from
the FLOSSMetrics project\footnote{\url{http://flossmetrics.org}} are used to extract 
information out of the repository logs and store it in a database. 

However, although the proposed database archives are full of useful information,
they have some drawbacks with respect to their structure, freshness of data
and purpose of use. Most data archives are stored in SQL databases, having
predetermined schemas that are difficult to evolve. A prominent
example is the FLOSSMetrics schema template instantiated in many studies
and tools like CVSAnaly2. Whether directly reused or through implementing tools, 
researchers often end up modifying the schema to fit their specific 
research goal~\cite{goeminne13}, which defeats purpose. Another prominent
example is the set of 15 schemas proposed by FLOSSMole~\cite{flossmoleschema2012},
that defeat comprehensive studies on a homogeneous basis.

The other main issue is related to maintaining the freshness of these data archives,
due to the very evolving nature of software development activities.
Combined with the cost of storage space, it is a real challenge that is hard to cope with,
resulting in many archives being quickly outdated or having limited time spans.
The impressive data collection of FLOSSMole, started from 2004 and counting more than
1 TB from many different software forges is selectively updated on a user-donation
basis. SourceForge Research Data Archive~\cite{Van-Antwerp:2008} (SRDA) is the most maintained
and up-to-date data collection available online. It follows its own process of maintenance
thanks to University of Notre Dame, on behalf of SourceForge, 
and has its own SQL schema of 73 tables. Yet, the contract with SourceForge does not allow
them to distribute the data, so queries must be performed online via a form. 
Timeouts or unchecked errors may happen, as well as unavailability of the archive due
to disk failures.

As we pointed out, the above approaches suffer from some limitations.
They all feature static data archives dumps, probably costly to maintain 
and selectively updated, thanks to the capability and availability of the people 
who benevolently donate the data and their time. Their SQL schemas may evolve, 
but researchers do not have upstream control on them. Therefore the kind of queries 
that can be made on the data and their structure are bound to those schemas.

The approach we propose in this paper is meant to provide researchers with
an extensible framework, thanks to which they can build their own data collection
for analysis, provided they have access to the data sources they want to study.
MSR4J is bound to the Java language and relies on a data model in the fashion of
graph databases, which greatly facilitates extension as well as incremental and continuous 
updates. Any database back end could be used. As a shaping example of how
MSR4J could be used, we explored the projects of the Apache Software Foundation (ASF). 
We used the organisation of these projects, two recent studies on the ASF and 
the CVSAnaly2 data schema as starting points to design the first data model of MSR4J.

The remainder of this paper is organised as follows. Section~\ref{sec:sota}
discusses related work to the design of the data model presented in Section~\ref{sec:datam},
and MSR4J architecture presented in Section~\ref{sec:archi}. Section~\ref{sec:limi} analyses
the limitations of this first iteration in the design of MSR4J. Section~\ref{sec:application}
shows the use of MSR4J for the case study of the ASF. Finally, Section\ref{sec:conc}
sketches future improvements and extensions to this study.
